Introduction

The Study

This study looked at body temperature distributions of diurnal desert lizards across three continents (North America, Africa, and Australia). Data was gathered over several decades and later recycled for use in this study. Body temperature distributions were expected to exhibit left-skewness. This study assessed whether skewness would differ between phylogeny, location, body size, or median body temperature, but no significant differences were found.

Data Analyses

Density plots were constructed for each African Kalahari desert lizard species and standardized so the modal temperature was set to zero. This allowed for comparison of skewness of body temperature between species. This also allowed visualization of left skewness with supporting data showing that seven species exhibited significant left skewness.

Supporting data included mean body temperature, median body temperature, mean absolute deviation, minimum and maximum body temperatures, number of samples per group, and the D’Agostino skewness coefficient and the related p value. Minimum sample size for inclusion was N = 23. Calculations were run separately for each season as well as continent/latitudinal area. Data for Australia and Africa was able to be pooled, but North American data spanned 14 degrees of latitude and needed to be separated into two regions because this can bias skewness. It was noted that the D’Agostino skewness coefficient is very sensitive to sample size and did not provide significant results for any species with N < 50. However, these were included due to the designation of N = 23 as the minimum sample size. One-tailed tests were used due to the hypothesis that all samples would be left-skewed.

Phylogenetic analysis was not done because they were using older data which did not include enough deserts/species with overlapping clades to be appropriate for phylogenetic analysis. Essentially, the preliminary requirements were not met for a full factorial ANOVA.

Preliminaries

Install {moments}.

Load the following packages:

> library(curl)
## Using libcurl 7.64.1 with LibreSSL/2.8.3
> library(ggplot2)
> library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble  3.1.5     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.2     ✓ forcats 0.5.1
## ✓ purrr   0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter()     masks stats::filter()
## x dplyr::lag()        masks stats::lag()
## x readr::parse_date() masks curl::parse_date()
> library(dplyr)
> library(moments)

Figure 1 Replication

Figure 1 Caption

Density plots of Tb of 10 Kalahari lizard species in summer: distributions are centred on modal temperature of each species. Left skewness is visually evident for many species and significant in seven (Table S1)

Step 1: Load the Data

> f <- curl("https://raw.githubusercontent.com/maekell98/nkelley-data-replication-assignment/main/doi_10-5/KalahariTbData.csv")
> kl <- read.csv(f, header = TRUE, sep = ",", stringsAsFactors = FALSE)
> head(kl)
##   area number latitude        species   Tb  time     date    family
## 1    L  15065   -26.38 Agama aculeata 35.2 12:25 11/28/69 Agamdidae
## 2    L  15087   -26.38 Agama aculeata 38.0 16:28 11/28/69 Agamdidae
## 3    K  15092   -25.75 Agama aculeata 37.2 10:57 11/29/69 Agamdidae
## 4    L  15123   -26.38 Agama aculeata 35.6 17:40 11/28/69 Agamdidae
## 5    K  15141   -25.75 Agama aculeata 37.0 12:50 11/30/69 Agamdidae
## 6    K  15144   -25.75 Agama aculeata 38.0 15:30 11/30/69 Agamdidae

Step 2: Clean up the data and select the correct subset for analysis

Note: Southern latitude summer = December through February. Therefore, we must subset the data by date to include only data collected in the summer months.

“We arbitrarily set a minimum sample size of N = 23 Tb for inclusion.” Since Trachylepis variegata does not have this many samples, it will be excluded from further analysis at the end of the the following chunk.

> #in order to subset by date, dates must be coerced into a format recognized by r:
> #date formatting: https://stackoverflow.com/questions/38146160/how-to-convert-dd-mm-yy-to-yyyy-mm-dd-in-r
> kl$date <- as.Date(kl$date, "%m/%d/%Y")
> 
> #kls = kalahari lizards summer subset
> #date subsetting: https://www.statology.org/subset-by-date-range-in-r/
> kld <- kl[kl$date >= "0069-12-02" & kl$date <= "0070-02-27", ]
> 
> #not enough data for species Trachylepis spilogaster, deleting for further analyis
> kls <- subset(kld, species!="Trachylepis variegata")
> head(kls)
##    area number latitude        species   Tb  time       date    family
## 7  <NA>  15166   -27.07 Agama aculeata 38.4 14:18 0069-12-02 Agamdidae
## 9  <NA>  15192   -27.07 Agama aculeata 40.4 14:33 0069-12-02 Agamdidae
## 10 <NA>  15194   -27.07 Agama aculeata 36.2 14:54 0069-12-02 Agamdidae
## 11 <NA>  15233   -26.85 Agama aculeata 37.6 11:00 0069-12-04 Agamdidae
## 12 <NA>  15241   -26.80 Agama aculeata 37.8  8:50 0069-12-05 Agamdidae
## 13 <NA>  15242   -26.80 Agama aculeata 34.8  8:55 0069-12-05 Agamdidae

Step 3: Create Subsets to be Plotted

In order to obtain density plots by species, we first subset by species. Each subset is mutated so they are standardized for comparison with each other. This involves subtracting the body temperature mode from each body temperature measurement which essentially sets the mode for each sample to 0 without changing the relationship between the individual points. I was having a really hard time at first figuring our how they centered their distributions around the modal temperatures of each species. Ultimately, this was my solution. My plot differs somewhat from theirs, so this might be a source of that discrepancy. However, I don’t see a reason why my solution shouldn’t work.

> #used unique to quickly get a list of species in the kl data set
> unique(kld$species)
##  [1] "Agama aculeata"            "Heliobolus lugubris"      
##  [3] "Meroles squamulosus"       "Meroles suborbitalis"     
##  [5] "Nucras tessellata"         "Pedioplanis namaquensis"  
##  [7] "Pedioplanis_lineoocellata" "Trachylepis occidentalis" 
##  [9] "Trachylepis sparsa"        "Trachylepis spilogaster"  
## [11] "Trachylepis variegata"
> #define function for calculating mode: https://r-lang.com/mode-in-r/
> getmode <- function(v) {
+  uniqv <- unique(v)
+  uniqv[which.max(tabulate(match(v, uniqv)))]
+ }
> 
> #now we create a subset for each species in which the mode of each Tb is adjusted to 0 and represented by column Tbm
> ss1 <- subset(kls, species == "Agama aculeata") %>%
+   mutate(Tbm = Tb - getmode(Tb))
> 
> ss2 <- subset(kls, species == "Heliobolus lugubris") %>%
+   mutate(Tbm = Tb - getmode(Tb))
> 
> ss3 <- subset(kls, species == "Meroles squamulosus") %>%
+   mutate(Tbm = Tb - getmode(Tb))
> 
> ss4 <- subset(kls, species == "Meroles suborbitalis") %>%
+   mutate(Tbm = Tb - getmode(Tb))
> 
> ss5 <- subset(kls, species == "Nucras tessellata") %>%
+   mutate(Tbm = Tb - getmode(Tb))
> 
> ss6 <- subset(kls, species == "Pedioplanis lineoocellata") %>%
+   mutate(Tbm = Tb - getmode(Tb))
> 
> ss7 <- subset(kls, species == "Pedioplanis namaquensis")%>%   mutate(Tbm = Tb - getmode(Tb))
> 
> ss8 <- subset(kls, species == "Trachylepis occidentalis") %>% mutate(Tbm = Tb - getmode(Tb))
> 
> ss9 <- subset(kls, species == "Trachylepis sparsa") %>%
+   mutate(Tbm = Tb - getmode(Tb))
> 
> ss10 <- subset(kls, species == "Trachylepis spilogaster") %>% mutate(Tbm = Tb - getmode(Tb))

Step 4: Create ggplot

Now we can run the ggplot. In order to show all the species subsets on the same plot, we add geom_density() with each of the different subsets. I feel there may have been a more stream-lined way to do this, but this did not take me too long either.

> #constructing density  plots: http://www.sthda.com/english/wiki/ggplot2-density-plot-quick-start-guide-r-software-and-data-visualization
> s <- ggplot(ss1, aes(x=Tbm)) + 
+   geom_density() +
+   geom_density(data = ss2) +
+   geom_density(data = ss3) +
+   geom_density(data = ss4) +
+   geom_density(data = ss5) +
+   geom_density(data = ss6) +
+   geom_density(data = ss7) +
+   geom_density(data = ss8) +
+   geom_density(data = ss9) +
+   geom_density(data = ss10) +
+   #add mean line:
+   geom_vline(aes(xintercept=0),
+             color="black", linetype="dashed") +
+   #classic theme seems most appropriate (removes grid background, adds tick marks, black and white)
+   theme_classic() +
+   #set axes:
+   xlim(-15, 5) + ylim(0, 0.5)+
+   #set tick marks:
+   scale_y_continuous(breaks = c(0.0, 0.1, 0.2))+
+   scale_x_continuous(breaks = c(-15, -10, -5, 0, 5))+
+   #add labels,
+   #code for degrees celsius symbol: https://stackoverflow.com/questions/51799118/writing-the-symbol-degrees-celsius-in-axis-titles-with-r-plotly/51799161 
+   labs(title="Density plots Kalahari lizards, summer",x="Body temperature (\u00B0C), centered on modal temperature", y = "Density")
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
> s

We can see from the figure that most Kalahari lizard species body temperature distributions exhibit left skewness. This relationship was also observed in the original article figure 1.

Comparison with Original Figure 1

For comparison, we have the origial plot:

Figure 3 Replication

Figure 3 Caption

Body temperature skewness (jittered) for desert lizards, with values by family (or by subfamily for Agamidae) and by continent. Three families are represented by one species. No pattern (desert, taxon) is evident.

Step 1: Load data sets (by country) and calculate statistics

Data was separated into three sets: Africa, Australia, and North America. Below, we load each data set and calculate the statistics for each. I used “kl”, “al”, and “nl” to name the data sets, representing “kalahari lizards”, “australia lizards”, and “north american lizards”. This syntax is retained throughout.

> f <- curl("https://raw.githubusercontent.com/maekell98/nkelley-data-replication-assignment/main/doi_10-5/KalahariTbData.csv")
> kl <- read.csv(f, header = TRUE, sep = ",", stringsAsFactors = FALSE)
> kl['desert']='Africa'
> head(kl)
##   area number latitude        species   Tb  time     date    family desert
## 1    L  15065   -26.38 Agama aculeata 35.2 12:25 11/28/69 Agamdidae Africa
## 2    L  15087   -26.38 Agama aculeata 38.0 16:28 11/28/69 Agamdidae Africa
## 3    K  15092   -25.75 Agama aculeata 37.2 10:57 11/29/69 Agamdidae Africa
## 4    L  15123   -26.38 Agama aculeata 35.6 17:40 11/28/69 Agamdidae Africa
## 5    K  15141   -25.75 Agama aculeata 37.0 12:50 11/30/69 Agamdidae Africa
## 6    K  15144   -25.75 Agama aculeata 38.0 15:30 11/30/69 Agamdidae Africa
> #Basic R script for calculating stats describing Tb distributions:
> klskew <- kl %>%
+     group_by(family) %>%
+     summarise(
+         meanTb = round(mean(Tb, na.rm = TRUE), digits = 2),
+         medTb  = median(Tb,na.rm = TRUE),
+         MAD = round(mad(Tb, na.rm = TRUE), digits = 2),
+         minTb = min(Tb, na.rm = TRUE),
+         maxBT = max(Tb, na.rm = TRUE),
+         nBT = n(),
+         SkewCoef = agostino.test(Tb, alternative = "greater")$statistic[1],
+         SkewP = agostino.test(Tb, alternative = "greater")$p.value,
+         desert = unique(desert))
> g <- curl("https://raw.githubusercontent.com/maekell98/nkelley-data-replication-assignment/main/doi_10-5/AustraliaTbData.csv")
> al<- read.csv(g, header = TRUE, sep = ",", stringsAsFactors = FALSE)
> al['desert']='Australia'
> #Here I rename a row because the other two data sets use Tb as the column name for blood temperature but this one for some reason used "BT". Renaming allows for later analysis. 
> names(al)[4]<-paste("Tb")
> head(al)
##   area number                   species   Tb  time     date  microhabitat
## 1    E  10448           Varanus eremius 41.0 14:35 01/20/66   terrestrial
## 2 <NA>   9828   Ctenophorus reticulatus 31.8 15:15 10/03/66 semi-arboreal
## 3 <NA>   9830   Ctenophorus reticulatus 31.6 15:24 10/03/66 semi-arboreal
## 4 <NA>   9832 Lophognathus longirostris 37.6 11:15 10/04/66      arboreal
## 5 <NA>   9834  Ctenophorus caudicinctus 38.0 14:00 10/05/66    saxicolous
## 6 <NA>   9835      Ctenophorus isolepis 39.2 16:00 10/05/66   terrestrial
##      family    desert
## 1 Varanidae Australia
## 2  Agamidae Australia
## 3  Agamidae Australia
## 4  Agamidae Australia
## 5  Agamidae Australia
## 6  Agamidae Australia
> alskew <- al %>%
+     group_by(family) %>%
+     summarise(
+         meanTb = round(mean(Tb, na.rm = TRUE), digits = 2),
+         medTb  = median(Tb,na.rm = TRUE),
+         MAD = round(mad(Tb, na.rm = TRUE), digits = 2),
+         minTb = min(Tb, na.rm = TRUE),
+         maxBT = max(Tb, na.rm = TRUE),
+         nBT = n(),
+         SkewCoef = agostino.test(Tb, alternative = "greater")$statistic[1],
+         SkewP = agostino.test(Tb, alternative = "greater")$p.value,
+         desert = unique(desert))
> h <- curl("https://raw.githubusercontent.com/maekell98/nkelley-data-replication-assignment/main/doi_10-5/NorthAmericanTbData.csv")
> nl <- read.csv(h, header = TRUE, sep = ",", stringsAsFactors = FALSE)
> nl['desert']='North Amer.'
> head(nl)
##   site number latitude    NS                species   Tb time     date
## 1  NAN   5169     42.2 north    Aspidoscelis tigris 38.2 7:17 07/01/62
## 2  NAN   5170     42.2 north Phrynosoma platyrhinos 32.0 7:20 07/01/62
## 3  NAN   5171     42.2 north    Aspidoscelis tigris 38.0 7:29 07/01/62
## 4  NAN   5172     42.2 north       Uta stansburiana 36.0 7:34 07/01/62
## 5  NAN   5173     42.2 north       Uta stansburiana 33.2 7:43 07/01/62
## 6  NAN   5174     42.2 north       Uta stansburiana 35.4 7:44 07/01/62
##            family      desert
## 1         Teiidae North Amer.
## 2 Phrynosomatidae North Amer.
## 3         Teiidae North Amer.
## 4 Phrynosomatidae North Amer.
## 5 Phrynosomatidae North Amer.
## 6 Phrynosomatidae North Amer.
> nlskew <- nl %>%
+     group_by(family) %>%
+     summarise(
+         meanTb = round(mean(Tb, na.rm = TRUE), digits = 2),
+         medTb  = median(Tb,na.rm = TRUE),
+         MAD = round(mad(Tb, na.rm = TRUE), digits = 2),
+         minTb = min(Tb, na.rm = TRUE),
+         maxBT = max(Tb, na.rm = TRUE),
+         nBT = n(),
+         SkewCoef = agostino.test(Tb, alternative = "greater")$statistic[1],
+         SkewP = agostino.test(Tb, alternative = "greater")$p.value,
+         desert = unique(desert))

Step 2: Combine Data into One Data Frame

For analysis of all lizards in the same figure, we combine all the data frames.

> #bind data -- retain all columns but combine columns in common; NA for columns not in common
> #https://www.geeksforgeeks.org/combine-two-dataframes-in-r-with-different-columns/
> ld <- bind_rows(kl, nl, al)
> #oop. Maybe I didn't need this. I actually needed to combine the statistics data frames which I do in the next line.
> 
> #combine skew summaries:
> lskew <- bind_rows(nlskew, alskew, klskew)

Step 3: Create Dot Plot

Now we use a dot plot to show the skewness by lizard family and country of origin.

> #make a dot plot: http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html
> lplot <- 
+   ggplot(lskew, aes(x=SkewCoef, y=family, group=desert)) +
+   #specify shape/color of points: http://www.sthda.com/english/wiki/ggplot2-point-shapes
+   geom_point(aes(shape=desert, color=desert), size = 3) +
+   scale_shape_manual(values=c(25, 8, 17))+
+   scale_color_manual(values=c('red', 'blue', 'grey48'))+
+   #frustrated because I couldn't get the Africa triangle to fill. I think it would've worked if all the shapes were fillable, but since two were not, the one that is refuses to fill... I tried to get around this for quite a while, but did not find the solution. 
+   scale_fill_manual(values='red', 'blue', 'grey48')+
+   labs(x = "Skewness", y = "Families or subfamilies") +  
+   theme_classic() +
+   xlim(-2.0, 0.5) +
+   scale_x_continuous(breaks = c(-2.0, -1.5, -1.0, -0.5, 0.0, 0.5)) +
+    geom_vline(aes(xintercept=0),
+             color="black", linetype="dashed") +
+   xlim(-2.0, 0.5)
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
> lplot

Comparison with Original Figure 3

For comparison, we have the original plot:

You can see my plot did not include the phylogenetic comparison. It also contains different families compared to the original plot. I was unable to find the families used in this plot when I was going through the data. I am not sure if they used other outside information to further classify families/subfamilies, but the lack of clarification made it impossible for me to truly replicate their figure.

Figure 4 Replication

Figure 4 Caption

Body temperature histograms (with density curves) for Kalahari lizards in summer. Each panel gives the species name, family, D’Agostino skewness coefficient with significance levels adjusted for multiple comparisons (p < .05, p < .01 and p < .001) and sample size.

Step 1: Prep Data Frame for Facet Annotations

I wanted to call the skew data frames to annotate the facets, but was having trouble getting the code to work. Instead, I worked around by taking the data from the skew data frames and creating a new data frame with the information I was going to use to annotate the facets.

> #to specify figure dimensions: https://stackoverflow.com/questions/39634520/specify-height-and-width-of-ggplot-graph-in-rmarkdown-knitr-output
> 
> #prep data frame for individual facet annotations:
> ann <- kls %>%
+   group_by(species) %>%
+   summarise(
+     Family = unique(family),
+     )
> sig <- c('-1.291 ***', '-0.182 ns', '0.162 ns', '-1.241 ***', '-1.152 **', '1.878 ***', '-0.350 ns', '-1.295 ***', '-1.490 ***', '-1.350 ***')
> n <- c('N = 75', 'N = 30', 'N = 83', 'N = 124', 'N = 38', 'N = 290', 'N = 69', 'N = 101', 'N = 73', 'N = 106')
> ann <- cbind(ann, sig, n)
> #n and sig can be found in skewTb data frame as well, but I was having trouble getting these onto my ggplots using just the skewTb df, so I created a new dataframe with the relevant information. 

Step 2: Make the ggplot

This ggplot was the most fun to make. I learned about using facets as well as quickly annotating facets using prepared information. I feel like this figure contains a lot of information but the code appears very efficient. I had no idea where to begin, but luckily I stumbled upon facets at just the right time! I started by focusing on formatting one ggplot with the histogram and density plot. I was just starting to annotate it and was about to copy and paste for each species when I discovered facets. This saved a lot of time.

> #code for ggplot histogram+density plot: http://www.sthda.com/english/wiki/ggplot2-density-plot-quick-start-guide-r-software-and-data-visualization
> dph1 <- kls %>% 
+   ggplot(aes(x=Tb))+
+   geom_histogram(aes(y=..density..), alpha=0.5, position="identity", colour="black", fill="grey")+
+   geom_density(alpha=.2)+
+   theme_classic()+
+   #set axes:
+   xlim(25, 45) + ylim(0, 0.3)+
+   #facet_wrap(): https://ggplot2-book.org/facet.html
+   #arrange grid (by species) and coerce into 3x4 using nrow
+   facet_wrap(~species, nrow = 4)+
+   #add annotations using geom_text: https://r-graphics.org/recipe-annotate-facet
+   geom_text(data = ann,x = 29, y = 0.28, aes(label = Family))+
+   geom_text(data = ann,x = 29, y = 0.24, aes(label = sig))+
+   geom_text(data = ann,x = 29, y = 0.2, aes(label = n))+   #labels:
+   labs(title="Kalahari diurnal lizards--summer",x="Body temperature (\u00B0C)", y = "Density")
> dph1
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Comparison with Original Figure 4

As you will see, my ggplots appears the same if you take your glasses off. If you look too closely, there are minor discrepancies. I wonder if this is due to differences in the data range selected. In the end, the overall conclusions are the same though.

Also, a final word on theme: I chose theme_classic() as the theme that most closely replicated their figures. However, they consistently have a thicker font/lines that I do. I googled ggplot themes and was exploring for a closer fit, but I did not find a better solution. However, overall, I am happy with the nearly perfect match that theme_classic() provided. I also would choose to use this theme on my own plots in the future.

Final Comments

Personal Takeaways

After this assignment, I finally feel comfortable using ggplot to accomplish my goals. I also feel that I was able to refine some of my coding style. While it’s not perfect, I think this project left me feeling much more confident in my coding abilities.

Comments on the Replication/Article

I found that this article provided enough information for me to replicate the most important analyses and come to the same conclusions as the authors. However, there was some important information missing such as the distinctions between families and subfamilies that they used to create their Figure 3. Ultimately, they came to the conclusion that the phylogenetic analysis was both inappropriate based on their data and when visualized did not provide any significant conclusions. Due to the fact that their data was not intended for phylogenetic analysis, I do not feel that being unable to recreate this analysis is particularly detrimental. However, it would still be preferable if they provided all information necessary to come to the same conclusions as they did. That said, I worked with several other papers/explored fairly extensively in DRYAD before settling on their paper and felt that their data was more clean and accessible to replication than many others before them.